Ch. 12 - Trees and Ensembles

So far, we have covered neural networks only. However, there are many, many more. The reason we have focused on neural nets so far (and will continue to do so throughout the material) is because neural nets work well on many different tasks. From classifying images to generating music, neural nets can accomplish many tasks. For structured data however, decision trees still make a strong showing. So in this chapter, we will have a look at them together with the powerful concept of ensembles: the combination of many models into one.

As always, before we start, let's load some basic libraries and the data:



In [1]:

    
import numpy as np
import pandas as pd
# Set seed for reproducability 
np.random.seed(42)
import matplotlib.pyplot as plt
# Supress warnings for better readability
import warnings; warnings.simplefilter('ignore')



In [2]:

    
# Load data
df = pd.read_csv('processed_bank.csv',index_col=0)



In [3]:

    
# Check that data is okay
df.head()









    Out[3]:







  
    
      
      campaign
      pdays
      previous
      emp.var.rate
      cons.price.idx
      cons.conf.idx
      euribor3m
      nr.employed
      y
      job_admin.
      ...
      month_sep
      day_of_week_fri
      day_of_week_mon
      day_of_week_thu
      day_of_week_tue
      day_of_week_wed
      poutcome_failure
      poutcome_nonexistent
      poutcome_success
      contacted_before
    
  
  
    
      34579
      -0.68711
      -2.555677e-13
      0.976408
      -0.758550
      -0.928102
      -1.122929
      -0.900202
      -0.418322
      0
      1
      ...
      0
      0
      0
      1
      0
      0
      1
      0
      0
      0
    
    
      446
      -0.68711
      -2.555677e-13
      -0.452557
      0.924213
      0.806766
      0.705671
      0.998971
      0.637509
      1
      0
      ...
      0
      0
      0
      0
      1
      0
      0
      1
      0
      0
    
    
      20173
      -0.13552
      -2.555677e-13
      -0.452557
      1.098292
      -0.059880
      0.761649
      1.056089
      1.063747
      1
      1
      ...
      0
      0
      1
      0
      0
      0
      0
      1
      0
      0
    
    
      18171
      -0.13552
      -2.555677e-13
      -0.452557
      1.098292
      0.687012
      -0.469858
      1.055031
      1.063747
      1
      1
      ...
      0
      0
      0
      0
      0
      1
      0
      1
      0
      0
    
    
      30128
      -0.68711
      -2.555677e-13
      -0.452557
      -0.758550
      -0.641321
      -1.290862
      -0.847844
      -0.418322
      0
      0
      ...
      0
      0
      0
      1
      0
      0
      0
      1
      0
      0
    
  

5 rows × 65 columns



In [4]:

    
# Process data into train / dev / test
# X is everything that is not y
X = df.loc[:, df.columns != 'y'].values
# y is y
y = df['y'].values

# First split in train / test_dev
from sklearn.model_selection import train_test_split
X_train, X_test_dev, y_train, y_test_dev = train_test_split(X, y, test_size=0.25, random_state=0)

# Second split in dev / test
X_dev, X_test, y_dev, y_test = train_test_split(X_test_dev, y_test_dev, test_size=0.5, random_state=0)

# Remove test_dev set from memory
del X_test_dev
del y_test_dev

Decision tree

The best way to understand a decision tree is to look at one. This tree is what we will create a few lines from here:

As you can see, a decision tree starts out by splitting the dataset into two datasets by one variable, in this case the employment. The goal is to arrive at sets that are either uniformly part of the yes crowd or uniformly nay sayers. So in this case, the algorithm has determined that a certain threshold in employment is the best separator of the two classes. It then recursively proceeds until it has found perfectly uniform groups or gets stopped by a maximum depth parameter. A little bit more formally, a tree classifier tries to minimize gini impurity across subsets. Gini impurity is the probability that a randomly chosen item from a set would be incorrectly labeled by the label of another random item from the same set. On a perfectly uniform set, this impurity would be zero.

The big advantage of decision trees is that they are easy to interpret. You can just look at the tree and see which variables matter. The disadvantage is that they are somewhat weak classifiers. They struggle to learn complex functions without over fitting.

Decision trees can be trained with sklearn.



In [5]:

    
# Import the corresponding class
from sklearn.tree import DecisionTreeClassifier

To prevent over fitting we usually need to set a maximum depth we want to allow. A common choice is 6 layers:



In [6]:

    
tree_classifier = DecisionTreeClassifier(max_depth=6)

As all sklearn classes, DecisionTreeClassifier can be trained with .fit()



In [7]:

    
tree_classifier.fit(X=X_train,y=y_train); # ; suppresses the output of the cell for cleaner reading

We can test our tree on the dev set with the .score()function, which outputs the accuracy.



In [8]:

    
tree_classifier.score(X_dev,y_dev)









    Out[8]:





0.72931034482758617

72.9% accuracy is not bad for a simple tree, especially one that was much faster to train than a neural net. However, it lags behind. This is where a truly powerful idea comes into play. What if we combined many weak classifiers into a strong one?

Bagging

Bagging (short for bootstrap aggregating) takes the idea further. It does not only train multiple classifiers but also trains them on different subsets of the data. This means that even if each classifier over fits 'their' subset, in aggregate they will not over fit the dataset. Sklearn also has an implementation for Bagging:



In [9]:

    
# Import class
from sklearn.ensemble import BaggingClassifier



In [10]:

    
# Setup, needs the classifier we want to use and the number of classifiers
bagger = BaggingClassifier(base_estimator=tree_classifier, n_estimators=50)



In [11]:

    
# Fit will train the specified number of classifiers
bagger.fit(X_train,y_train);



In [12]:

    
bagger.score(X_dev,y_dev)









    Out[12]:





0.73448275862068968

As you can see, many models do better than one. We can take this concept even further if we use some more tricks to avoid that the decision trees we train become too similar.

Random forests

Remember how a decision tree recursively splits the dataset by one variable that leads to more uniformity. In random forests, we give the splitting algorithm only access to a random subset of features at each step. This will lead to a bigger variety of trees as each branch of the trees has to work with different features. Therefore, we can avoid over fitting even more, but gain performance by training more trees.



In [13]:

    
from sklearn.ensemble import RandomForestClassifier



In [14]:

    
randomforest = RandomForestClassifier(max_depth=6,n_estimators=1000)



In [15]:

    
randomforest.fit(X_train,y_train);



In [16]:

    
randomforest.score(X_dev,y_dev)









    Out[16]:





0.73965517241379308

The random forest outperforms the decision tree and the simple bagging algorithm. It also performs slightly better than the best neural net from Ch. 10. Their good performance on structured data and their ease of training make random forests still a popular choice for working with structured data.

Gradient boosting

Another very popular technique for structural data is gradient boosting. While gradient boosting theoretically works with any machine learning algorithm (neural networks, too), it is mostly done with decision trees in practice. Gradient boosting works by iteratively adding classifiers that aim to reduce the residuals of the previous model. Say we have a classifier $F_m(x)$. We then fit a residuals classifier $h(x)$:

$$h(x) = y - F_m(x)$$

We then add the residuals classifier to our original classifier to obtain a better model:

$$F_{m+1}(x)=F_m(x)+h(x)=y$$

We can do this over and over, obtaining a better and better classifier. Take a look at how we compute $h(x)$ for a minute. It looks like the the gradient of the loss function from our neural networks in week 1! And in fact it is the gradient of the loss of the gradient boosting classifier. Gradient boosting does a form of gradient descent, too! There even is a learning rate, which is called shrinkage in gradient boosting.

$$F_{m+1}(x)=F_m(x)+\alpha * h(x)=y$$

The most popular implementation of gradient boosting is XGBoost. XGBoost uses decision trees as a base model and is designed to be highly scalable. It also features many more tricks to make gradient boosting work better. You don't have to know all the frills, the standard parameters usually work quite well.



In [17]:

    
# Get XGBoost and import the classifier
import xgboost as xgb
from xgboost import XGBClassifier









    



/Users/jannes/anaconda/lib/python3.5/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)



In [18]:

    
# Parameteris:
# Learning rate = alpha
# max_depth maximum depth of each tree
xgclassifier = XGBClassifier(learning_rate=0.1,max_depth=3)



In [19]:

    
# Train classifier
xgclassifier.fit(X_train,y_train);



In [20]:

    
# Scoring works exactly as with sklearn
xgclassifier.score(X_dev,y_dev)









    Out[20]:





0.74224137931034484

XGBoost achieves an accuracy of over 74.2%, that is better than the best neural network we came up with. It also trained a lot faster than a neural net. This is why it is still very popular for structured data. However, it reaches its limits on unstructured data like text, images or sound.

Stacking

Stacking is another ensemble technique. In stacking, we train a classifier using the outputs of other classifiers as input features. This image shows how it is done to win kaggle competitions:

Let's see how we could use stacking with the classifiers we have developed so far in this chapter. Sadly there is no easy sklearn method we could use for stacking so we will build our own. It is helpful to train the base and meta classifiers on different datasets to avoid over fitting, so first we will split our train data into a base training and a meta training set:



In [21]:

    
# Split train set into meta and base training sets
X_base, X_meta, y_base, y_meta = train_test_split(X_train, y_train, test_size=0.25, random_state=0)

We can then fit all of our base classifiers to the base training set:



In [22]:

    
tree_classifier.fit(X_base,y_base);
bagger.fit(X_base,y_base);
randomforest.fit(X_base,y_base);
xgclassifier.fit(X_base,y_base);

For good measure we can also throw in our neural net from chapter 11:



In [23]:

    
import keras
from keras.models import load_model









    



Using TensorFlow backend.






    



Couldn't import dot_parser, loading of dot files will not be possible.



In [24]:

    
neural_net = load_model('./support_files/Ch11_model.h5')

Now we create an input data set for the meta classifier by letting our models make predictions on the meta training set. Note that we must reshape all of our predictions to make sure they have the same shape when we feed them into the meta classifier:



In [25]:

    
# Get prediction from sigle tree classifier
treepred = tree_classifier.predict(X_meta).reshape(X_meta.shape[0],1)
# Get prediction from bagged tree classifier
baggerpred = bagger.predict(X_meta).reshape(X_meta.shape[0],1)
# Get prediction from random forrest
forestpred = randomforest.predict(X_meta).reshape(X_meta.shape[0],1)
# Get prediction from XGBoost
xgpred = xgclassifier.predict(X_meta).reshape(X_meta.shape[0],1)
# Get prediction from neural net
nnpred = neural_net.predict(X_meta).reshape(X_meta.shape[0],1)



In [26]:

    
# Combine predictions into meta features
meta_features = np.stack((treepred,baggerpred,forestpred,xgpred,nnpred),axis=1).reshape(X_meta.shape[0],5)

We can then train a meta classifier, let's make it another xgboost here:



In [27]:

    
# Train the meta classifier
meta = XGBClassifier()
meta.fit(meta_features,y_meta);

To make predictions we will define a new method:



In [28]:

    
def make_predictions(X):
    # Get meta predictions
    treepred = tree_classifier.predict(X).reshape(X.shape[0],1)
    baggerpred = bagger.predict(X).reshape(X.shape[0],1)
    forestpred = randomforest.predict(X).reshape(X.shape[0],1)
    xgpred = xgclassifier.predict(X).reshape(X.shape[0],1)
    nnpred = neural_net.predict(X).reshape(X.shape[0],1)
    # Combine predictions
    meta_features = np.stack((treepred,baggerpred,forestpred,xgpred,nnpred),axis=1).reshape(X.shape[0],5)
    # Make meta predictions
    meta_pred = meta.predict(meta_features)
    return meta_pred

In lack of a .score() method we can measure the accuracy with sklearns accuracy score function



In [29]:

    
from sklearn.metrics import accuracy_score



In [30]:

    
# Make predictions
predictions = make_predictions(X_dev)



In [31]:

    
# Turn predictions into definit predictions 
predictions[predictions >= 0.5] = 1
predictions[predictions < 0.5] = 0

# Measure accuracy
accuracy_score(y_dev,predictions)









    Out[31]:





0.73103448275862071

As you can see, our stacked model does worse than our best single model (XGBoost). This does happen. But in general, stacking can help make the model more robust and increase accuracy. There are many more interesting aspects to stacking that we will not cover in this material but that you might want to think about: What happens for example, if you train your base models on a specific subset of the data. Say, you create a base model that is only trained on young people from your dataset. How would you create a meta model that takes this into account? Or how many models do you think are practical? We could generate thousands of models but would it be worth the computational cost? What kind of model could capture the complexity of our overall function as well as the ensemble can, and how could we prevent it from over fitting?

Summary

In this chapter, you have seen some popular ensemble techniques and learned about decision trees. These tools are quite useful and are sort of the standard recipe together with neural nets for winning machine learning competitions. Trying out a few of these might well help you in this weeks competition. Good luck!

	campaign	pdays	previous	emp.var.rate	cons.price.idx	cons.conf.idx	euribor3m	nr.employed	y	job_admin.	...	day_of_week_mon	day_of_week_thu	day_of_week_tue	day_of_week_wed	poutcome_failure	poutcome_nonexistent
34579	-0.68711	-2.555677e-13	0.976408	-0.758550	-0.928102	-1.122929	-0.900202	-0.418322	0	1	...	0	1	0	0	1	0
446	-0.68711	-2.555677e-13	-0.452557	0.924213	0.806766	0.705671	0.998971	0.637509	1	0	...	0	0	1	0	0	1
20173	-0.13552	-2.555677e-13	-0.452557	1.098292	-0.059880	0.761649	1.056089	1.063747	1	1	...	1	0	0	0	0	1
18171	-0.13552	-2.555677e-13	-0.452557	1.098292	0.687012	-0.469858	1.055031	1.063747	1	1	...	0	0	0	1	0	1
30128	-0.68711	-2.555677e-13	-0.452557	-0.758550	-0.641321	-1.290862	-0.847844	-0.418322	0	0	...	0	1	0	0	0	1